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Abstract. We study the fame distribution of scientists and other social groups as measured by the number of Google hits 
garnered by individuals in the population. Past studies have found that the fame distribution decays either in power-law Q[ 
or exponential |2] fashion, depending on whether individuals in the social group in question enjoy true fame or not. In our 
present study we examine critically Google counts as well as the methods of data analysis. While the previous findings are 
corroborated in our present study, we find that, in most situations, the data available does not allow for sharp conclusions. 



1. INTRODUCTION 

The concept of Fame within a population has critical so- 
cial and economic impact. Recently, the idea of using the 
number of hits returned from a search of a person's name 
on Google as a means of quantifying that person's fame 
has been explored yj, |2|] . A seminal paper explored the 
fame of a unique population, that of World War I "ace" 
pilots 1 1], and found, among other things, a power-law 
decay in the tail of the distribution. More recent work |01 
has applied this to a population of scientists who have 
published on the cond-mat e-print archive 1 . The tail of 
their fame distribution was best fit by an exponential. 
On the other hand, the fame of other populations was 
found to follow a power-law decay. The difference was 
attributed to the fact that scientists habitually use the 
World Wide Web as a professional means of communi- 
cation and cite each other on the web in relation to their 
published work. 

Google's goal as a service is to provide accurate search 
results to its users. For the purposes of determining a sub- 
ject's fame, what is most relevant is not having accurate 
results listed first, as it is for most users, but to have an ac- 
curate count of those results. Unfortunately, Google does 
not provide enough accuracy, and there are several rea- 
sons for this 1 3]. 

Google acknowledges that the hits count given is an 
estimate, but does not elaborate on the accuracy of this 
estimation nor reveal how it is calculated. It seems rea- 
sonable to assume that very small counts are more accu- 
rate than larger ones. This means that the error is largest 
in the tail of the fame distribution, and it is this region 
that is of most interest. In addition, the tail of the dis- 



tribution is more likely to contain results that are over- 
counted, further compounding the error. 

In jjj], over-counting was prevented by verifying each 
hit by hand, a time-consuming procedure that limited 
the sample size. At the time of this writing, Google 
only returns the first 1000 hits, so it is impossible to 
verify the accuracy of any results beyond that number, 
and one must trust in Google 's estimation. Even manual 
verification is limited. 

The previous searches in jQJ, 01 used a search lexi- 
con including the boolean OR operator. We have since 
found out that Google returns incorrect hit counts when 
OR is used @|. For a simple illustration, a search for 
cars OR automobiles returns 80.5 million hits (at 
the time of this writing) while searches for cars and 
automobiles return 94.2 million and 8.82 million 
hits, respectively, violating basic set theory. Thus, the 
previous work must be reproduced using a better lexicon. 
In the current work, all our searches avoid the problem- 
atic OR operator. See Tabled 

Despite these issues, Google still provides an excel- 
lent tool for research. It is the simplest means of getting 
the most information available and it commands a very 
large sample space. For example, the work in 0] uses 
hit counts to "teach" the semantic meaning of words to 
software — a central problem in Artificial Intelligence. 
Related words such as 'painter' and 'artist' will have 
many more joint occurrences than disparate words, such 
as 'plumber' and 'artist', leading to higher hit counts. 
Their work confirms that Google yields reasonable re- 
sults when avoiding the OR and using the AND operator 
only. 

Google has been generous enough to open their search 
interface to allow tools to be created that can perform 
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Google searches automatically 2 . We have used this to 
eliminate the laborious task of entering single queries 
and recording the hits count. Larger populations can be 
searched much more quickly using an automatic tool. For 
the present work, we used a script that performed the 
searches from a web server. An easier way still is with the 
open-source PyGoogle package 3 , which integrates the 
Google search interface with the Python programming 
language. 

In both 1 1] and |2], fits were performed by binning the 
search results with exponentially-sized bins and then fit- 
ting to the binned data using least-squares. A better tech- 
nique than binning, when working with sparse data, is 
examining cumulative distributions @|, as we do in the 
present work. Also, it has been shown that there are prob- 
lems with using least-squares fits to logarithmic plots 1 6 ] . 
One problem is that the log operation magnifies the error 
in the tail. Least-squares fitting assumes that errors for 
each data point are uniform and will not properly weigh 
the noisier tail. In this work, we use a more robust tech- 
nique, that of Maximum Likelihood, to achieve less bi- 
ased fits. See section|2] 

All distributions studied here exhibit a power-law tail, 
although for many the tail covers a very narrow range. 
For the scientists populations, we observe a power-law 
tail only in the top 12% of the data. Most of the data for 
scientists is best fit by an exponential, just as found in the 
previous study Ql • In contrast, other populations do not 
fit to an exponential over any sizable range. See section 

El 



2. MAXIMUM LIKELIHOOD FITTING 

A better technique to determine the parameter(s) of a 
probability distribution from sampled data is that of 
Maximum Likelihood. The results are more robust in 
terms of error- weighing. This is a very common tech- 
nique and is covered in many statistics and regression 
texts. 

To briefly illustrate how maximum likelihood works, 
let us derive the Maximum Likelihood Estimator (MLE) 
for A, the parameter for an exponential probability dis- 
tribution: 



P(x) = Xe 



-Xx 



(1) 



The goal of Maximum Likelihood is to find the most 
likely X given the existing data. For this, we start with 
the probability of the experiment given X, assuming in- 
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dependence of the data points: 



P(x 1 ,X2,...x N \k) = f[Xe- Ui 



(2) 



= X N exp^-XY j x l j (3) 

where Xi is the unbinned data gathered from the exper- 
iment. This function is the total probability of all mea- 
surements occurring in the experiment. From this, we de- 
fine the likelihood function, using Bayes' Theorem: 



I (X \xi,x 2 , —Xn) = P{x\,x 2 ,...Xn\X) 



P(X) 



P(xi,x 2 ,...x N ) 



(4) 



This is the likelihood of X given the experimental data. 
Assuming P(xi,x 2 , ...xn) = 1 (the experiment has al- 
ready occurred) and P(X) is uniformly distributed (all 
A's are equally likely), then l(X\x) °= P(x\X). To find 
the most likely A, we must find the maximum of this 
function with respect to the parameter A . To simplify the 
calculation, we will instead maximize the log-likelihood 
function, L, which is equivalent: 



L = ln(7) = NlnX -A£x, 

i=l 



(5) 
(6) 
(7) 



This is just the inverse of the mean, exactly as expected 
for an exponential distribution. We need not account for 
the proportionality between l(X\x) and P(x\X) because 
we only used the derivative of L 

The other probability distribution we are concerned 
with is the power-law distribution: 
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P(x) « X 



(8) 



For the MLE of y, we reproduce the derivation given in 
131 . The first step is to normalize Eqn. for the given 
data points: 



P{X) = OX'? : 



7-1 



l rain \ -^min 



-7 



(9) 



where C is a constant of proportionality and x^ B is the 
smallest data point from the given sample. This is then 
used to get the probability of the experiment: 



P(x l ,x2,..Mr) = UP^) = U—( J1 -) 

i—\ i = { X m in \X m [ n / 



-7 



(10) 



This is proportional to l(y\x) as before: 

l(Y\ Xl , X2 ,.., XN )=Y[ 1 ^-(—) 7 (ID 

Again, we work with L — InZ, which is equivalent for 
finding the most likely 7. Then: 

L=Z fln(y-l)-h«min-yln— ) (12) 

N 

= Mn(y-l)-Mnx min -y^ln— (13) 

The MLE for y can then be found: 

£_* -£ k JL_o 04, 
rfy y-l g x min 

7=l+^V/(fln— ) (15) 

\;=1 Xmln / 

Maximum Likelihood derives an estimator for a distri- 
bution's parameter(s), regardless of whether the sampled 
data truly does come from such a distribution. Hence, 
one needs a way to test how well the estimator matches 
the sample. For our purposes, the Kolmogorov-Smirnov 
(KS) Test works quite well |6]. This test compares the 
cumulative distribution function (CDF) of the hypothe- 
sized probability distribution to the empirical CDF of the 
sampled data. The test statistic is: 

K = sup\F(x)-S(x)\, (16) 

X 

where F(x) is the hypothesized CDF and S(x) is the 
empirical CDF. K is then compared with a critical value 
(for the given significance level) which can be found in 
a table or generated by software. MATLAB's Statistics 
Toolbox has a built-in KS-Test function, kstest ( ) . 



cently on cond-mat and was harvested using arXiv's 
OAI XML feed 4 . 

Aces: The population of 1851 aces contains the 393 
German aces studied in |JJ] as well as all the listed 
aces of other nationalities 5 . 

Actors: The actors population contains 778 actors who 
were born on the second or third of each month 
between the years 1950 and 1955, as collected from 
the archives of the Internet Movie Database 6 . These 
selection criteria were chosen to insure a mostly 
uniform sample and to give all the chosen subjects 
roughly the same career length. 

Villains: The villains population was gathered from a 
user-contributed list of antagonists from fictional 
media 7 . This list contains both fictitious charac- 
ters and real people who have appeared in fictional 
works. Since this list was generated by users, the 
characters must already enjoy a substantial level of 
popularity. 

Programmers: Similar to the villains population, this 
population was collected from a user-contributed 
list of famous programmers 8 ; people who have 
made a large contribution to computing, the Inter- 
net, etc., such as Tim Berners-Lee, who invented 
the World Wide Web, and Bill Gates, a co-founder 
of Microsoft. As with the villains, it seems safe to 
assume that this population is "famous". 

Clarkson Students: The students population was cho- 
sen from Clarkson University's student directory. It 
consists of all students (undergraduate and gradu- 
ate) whose last name contains the letter "e". This 
criterion was chosen simply to make it easy to har- 
vest a large collection of names from the online stu- 
dent directory. We assume this is a "non-famous" 
population, in that the students are too young to 
have amassed any real fame. 

Runners: This population was used previously The 
original searches used the erroneous OR operator 
and are here reproduced without it. 



3. POPULATIONS 

We have been able to greatly expand upon the number 
of searches performed compared to previous work. In 
addition, due to the problems with the OR operator, we 
have performed multiple searches of the same population 
using progressively inclusive lexicons. Here we describe 
the populations studied. 

Scientists: Two populations of scientists were used in 
this study. The smaller one (of size 449) is the same 
population used in |2J]. The larger population (of 
size 1625) is a list of authors who have published re- 



4. RESULTS AND ANALYSIS 

Table^contains the power-law exponents and search lex- 
icons for the populations studied. Many of the power-law 
exponents are « 2, as first predicted in (jj. All popula- 
tions display a power-law tail, regardless of whether they 
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are "famous" or not. It should be pointed out, however, 
that for some populations the range fitting a power-law is 
extremely narrow, casting doubt on this interpretation. In 
those cases, an exponential distribution may fit as well. 
Most of the scientists distributions fit an exponential over 
much of the "non-tail". See Table [2] Clarkson students, 
another population assumed to be non-famous, does not 
fit to an exponential over such a range. This is further 
evidence that the exponential distribution for scientists 
stems from their use of the World Wide Web as a pro- 
fessional means for disseminating research, rather than 
related to fame. 

The power-law exponent tends to increase as the re- 
striction due to the lexicon increases. This is expected be- 
cause a more restrictive search will make high hit counts 
less frequent, increasing the slope of the tail. Figure ^ 
contains rank / frequency plots for several populations 
to illustrate this effect. The plots are proportional to the 
empirical CDF, P(X > x). Note that individual searches 
which return zero hits are not shown, changing the max- 
imum rank between lexicons. This is most evident in the 
Students population: the third lexicon is very restrictive 
and many students garnered zero hits. 

The proposed model in 1 1 ] was shown to have a power- 
law exponent that approaches 2 asymptotically, from 
above, as the number of relevant web pages citing the 
population in question increases over time. The popula- 
tion changes in size in Tableware due to progressively 
restrictive lexicons and do not pertain to the same phe- 
nomenon. On the other hand, we are unable to account 
for the many instances of power-law exponents smaller 
than 2 observed, as any reasonable extension of the the- 
ory in 1 1 ] yields powers y > 2. 
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5. CONCLUSIONS 

A purely visual inspection of plots such as those in Fig- 
ure^niay lead one to conclude that a search is exponen- 
tially or power-law distributed, but this is misleading and 
subjective. The eye will overweigh the number of data 
points in the tail, due to the logarithmic axes. Objective 
hypothesis tests such as the KS-test must be used. 

In addition to problems with hit estimation, OR, etc., 
the choice of a lexicon has noticeable impact. In the rank 
/ frequency plot for the aces population in Figure ^ the 
second search shows a much cleaner tail, though again 
this region contains less than 6% of the aces. All of these 
factors make it difficult to test theories. For the size of 
populations involved, Google hits have too much "noise" 
to accurately distinguish distributions. 



TABLE 1. MLE Power-Law Fits to Search Results. All fits pass the KS-test (a = 0.05). 
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* For example, Top 99 means that the fit was applied to only the 99 highest-ranked searches. Note that some 
sub-samples constitute less than 10 percent of the population and that the tail contains the noisiest, and therefore 
least reliable, data. 



TABLE 2. MLE Exponential Fits to Search Results. All fits pass the KS-test 
(a = 0.05) except for Scientists (1625) Search 4. 



Population (Size) 


Search 
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Range 
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* The critical value that is compared to K. A distribution passes the KS-test when 
K < CV. 




FIGURE 1. Rank / Frequency plots for several populations. The horizontal axis is the number of Google hits and the vertical axis 
is the rank of the (sorted) data points. Note that straight lines (offset for clarity) will have slope — y+ 1. 



